NSF PAR Search | NSF Public Access Repository

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving.

Li, Zhuohan; Lianmin, Zheng; Zhong, Yinmin; Liu, Vincent; Sheng, Ying; Jin, Xin; Huang, Yanping; Chen, Zhifeng; Zhang, Hao; Gonzalez, Joseph; et al (July 2023, USENIX Association)

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10× higher rates or 6× more burstiness while staying within latency constraints for more than 99% of requests.

Full Text Available

Search for: All records